{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# LAB 1a:  Exploring natality dataset.\n",
    "\n",
    "**Learning Objectives**\n",
    "\n",
    "1. Use BigQuery to explore natality dataset\n",
    "1. Use Cloud AI Platform Notebooks to plot data explorations\n",
    "\n",
    "\n",
    "## Introduction \n",
    "Before we begin model development and training, in this notebook we will explore the natality dataset to predict the weight of a baby before it is born.  We will use BigQuery to explore the data and use Cloud AI Platform Notebooks to plot data explorations.\n",
    "\n",
    "Each learning objective will correspond to a __#TODO__ in this student lab notebook -- try to complete this notebook first and then review the [solution notebook](../solutions/1a_explore_data_babyweight.ipynb)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "hJ7ByvoXzpVI"
   },
   "source": [
    "## Load necessary libraries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "mC9K9Dpx1ztf"
   },
   "source": [
    "Check that the Google BigQuery library is installed and if not, install it. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 609
    },
    "colab_type": "code",
    "id": "RZUQtASG10xO",
    "outputId": "5612d6b0-9730-476a-a28f-8fdc14f4ecde"
   },
   "outputs": [],
   "source": [
   "%%bash\n",
    "sudo pip3 freeze | grep google-cloud-bigquery==1.6.1 || \\\n",
    "sudo pip3 install google-cloud-bigquery==1.6.1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from google.cloud import bigquery"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "L0-vOB4y2BJM"
   },
   "source": [
    "## Source dataset\n",
    "\n",
    "Our dataset is hosted in [BigQuery](https://cloud.google.com/bigquery/). The CDC's Natality data has details on US births from 1969 to 2008 and is a publically available dataset; anyone with a GCP account has access. Click [here](https://console.cloud.google.com/bigquery?project=bigquery-public-data&p=publicdata&d=samples&t=natality&page=table) to access the dataset.\n",
    "\n",
    "The natality dataset is relatively large, at nearly 138 million rows and 31 columns, but it is simple to understand. The column `weight_pounds` is the target, the continuous value we’ll train a model to predict."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<h2> Explore data </h2>\n",
    "\n",
    "The data is natality data (record of births in the US). The goal is to predict the baby's weight given a number of factors about the pregnancy and the baby's mother.  Later, we will want to split the data into training and eval datasets. The hash of the year-month will be used for that -- this way, twins born on the same day won't end up in different cuts of the data. We'll first create a SQL query using the natality data after the year 2000."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "query = \"\"\"\n",
    "SELECT\n",
    "    weight_pounds,\n",
    "    is_male,\n",
    "    mother_age,\n",
    "    plurality,\n",
    "    gestation_weeks,\n",
    "    FARM_FINGERPRINT(\n",
    "        CONCAT(\n",
    "            CAST(YEAR AS STRING),\n",
    "            CAST(month AS STRING)\n",
    "        )\n",
    "    ) AS hashmonth\n",
    "FROM\n",
    "    publicdata.samples.natality\n",
    "WHERE\n",
    "    year > 2000\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's create a BigQuery client that we can use throughout the notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bq = bigquery.Client()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now examine the result of a BiqQuery call in a Pandas DataFrame using our newly created client."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# Call BigQuery and examine in dataframe\n",
    "df = bigquery.Client().query(query + \" LIMIT 100\").to_dataframe()\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's get the set of all valid column names in the natality dataset. We can do this by accessing the `INFORMATION_SCHEMA` for the table from the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Query to get all column names within table schema\n",
    "sql = \"\"\"\n",
    "SELECT\n",
    "    column_name\n",
    "FROM\n",
    "    publicdata.samples.INFORMATION_SCHEMA.COLUMNS\n",
    "WHERE\n",
    "    table_name = \"natality\"\n",
    "\"\"\"\n",
    "\n",
    "# Send query through BigQuery client and store output to a dataframe\n",
    "valid_columns_df = bq.query(sql).to_dataframe()\n",
    "\n",
    "# Convert column names in dataframe to a set\n",
    "valid_columns_set = valid_columns_df[\"column_name\"].tolist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can print our valid columns set to see all of the possible columns we have available in the dataset. You could also find this information by going to the `Schema` tab when selecting the table in the [BigQuery UI](https://console.cloud.google.com/bigquery?project=bigquery-public-data&p=publicdata&d=samples&t=natality&page=table)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(valid_columns_set)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lab Task #1: Use BigQuery to explore natality dataset.\n",
    "Using the above code as an example, write a query to find the unique values for each of the columns and the count of those values for babies born after the year 2000.\n",
    "For example, we want to get these values:\n",
    "<pre>\n",
    "is_male\tnum_babies\tavg_wt\n",
    "False\t16245054\t7.104715\n",
    " True\t17026860\t7.349797\n",
    "</pre>\n",
    "This is important to ensure that we have enough examples of each data value, and to verify our hunch that the parameter has predictive value.\n",
    "\n",
    "Hint (highlight to see): <p style='color:white'>Use COUNT(), AVG() and GROUP BY. For example:\n",
    "<pre style='color:white'>\n",
    "SELECT\n",
    "  is_male,\n",
    "  COUNT(1) AS num_babies,\n",
    "  AVG(weight_pounds) AS avg_wt\n",
    "FROM\n",
    "  publicdata.samples.natality\n",
    "WHERE\n",
    "  year > 2000\n",
    "GROUP BY\n",
    "  is_male\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# TODO: Create function that gets distinct value statistics from BigQuery\n",
    "def get_distinct_values(valid_columns_set, column_name):\n",
    "    \"\"\"Gets distinct value statistics of BigQuery data column.\n",
    "\n",
    "    Args:\n",
    "        valid_columns_set: set, the set of all possible valid column names in\n",
    "            table.\n",
    "        column_name: str, name of column in BigQuery.\n",
    "    Returns:\n",
    "        Dataframe of unique values, their counts, and averages.\n",
    "    \"\"\"\n",
    "    assert column_name in valid_columns_set, (\n",
    "        \"{column_name} is not a valid column_name\".format(\n",
    "            column_name=column_name))\n",
    "\n",
    "    sql = \"\"\"\n",
    "    \"\"\"\n",
    "\n",
    "    pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lab Task #2: Use Cloud AI Platform Notebook to plot explorations.\n",
    "\n",
    "Which factors seem to play a part in the baby's weight?\n",
    "\n",
    "<b>Bonus:</b> Draw graphs to illustrate your conclusions\n",
    "\n",
    "Hint (highlight to see): \n",
    "<p style='color:white'># TODO: Reusing the get_distinct_values function you just implemented, create function that plots distinct value statistics from BigQuery\n",
    "\n",
    "Hint (highlight to see): <p style='color:white'> The simplest way to plot is to use Pandas' built-in plotting capability\n",
    "<pre style='color:white'>\n",
    "df = get_distinct_values(valid_columns_set, column_name)\n",
    "df = df.sort_values(column_name)\n",
    "df.plot(x=column_name, y=\"num_babies\", kind=\"bar\", figsize=(12, 5))\n",
    "df.plot(x=column_name, y=\"avg_wt\", kind=\"bar\", figsize=(12, 5))\n",
    "</pre>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: Create function that plots distinct value statistics from BigQuery\n",
    "def plot_distinct_values(valid_columns_set, column_name, logy=False):\n",
    "    \"\"\"Plots distinct value statistics of BigQuery data column.\n",
    "\n",
    "    Args:\n",
    "        valid_columns_set: set, the set of all possible valid column names in\n",
    "            table.\n",
    "        column_name: str, name of column in BigQuery.\n",
    "        logy: bool, if plotting counts in log scale or not.\n",
    "    \"\"\"\n",
    "    pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make a bar plot to see `is_male` with `avg_wt` linearly scaled and `num_babies` logarithmically scaled."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# TODO: Plot is_male"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make a bar plot to see `mother_age` with `avg_wt` linearly scaled and `num_babies` linearly scaled."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# TODO: Plot mother_age"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make a bar plot to see `plurality` with `avg_wt` linearly scaled and `num_babies` logarithmically scaled."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# TODO: Plot plurality"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make a bar plot to see `gestation_weeks` with `avg_wt` linearly scaled and `num_babies` logarithmically scaled."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# TODO: Plot gestation_weeks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "All these factors seem to play a part in the baby's weight. Male babies are heavier on average than female babies. Teenaged and older moms tend to have lower weight babies. Multiple birth babies (twins, triplets, etc.) are lower weight than single birth babies. Preemies weigh in lower as do babies born to single moms. In addition, it is important to check whether you have enough data (number of babies) for each input value. Otherwise, the model prediction against input values that doesn't have enough data may not be reliable.\n",
    "<p>\n",
    "In the next notebooks, we will develop a machine learning model to combine all of these factors to come up with a prediction of a baby's weight."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lab Summary: \n",
    "In this lab, we used BigQuery to explore the data and used Cloud AI Platform Notebooks to plot data explorations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
